sigident - Howto Microarray

library(sigident)
#> 
#> Attaching package: 'sigident'
#> The following object is masked from 'package:methods':
#> 
#>     signature
library(knitr)

Prerequisites and installation

Install sigident package

Preprocessing

Data preparation

In order to use this R package and its functions, you need to prepare a merged gene expression dataset manually. The workflow to achieve this is presented in the following.

Download of GEO datasets and normalization

This example uses the GEO lung cancer studies “GSE18842”, “GSE19804” and “GSE19188”, which contain in total 367 samples (197 tumor; 170 non-tumor) and 54,675 transcripts obtained by using Affymetrix GeneChip Human Genome U133 Plus 2.0 Array (platform GPL570).

To use microarray datasets from the GEO database the studies can be downloaded directly executing R code. Therefore, initializing the directory and creating the required folders is needed before downloading GEO studies. The object targetcol should also be titled “target” (correspondingly to the column name of the object sample_metadata below). targetcol contains the outcome variable of interest for subsequent analysis. The levels of targetcol are to be defined with controlname and targetname. For this example we used the package to analyze DEGs and identify diagnostic and prognostic gene signatures for the separation of two groups (or subtypes, healthy vs cancer, etc) we named controlname and targetname as “Control” and “Lung Cancer”.

Insert the GEO accesion numbers of the desired GEO studies into the getGEO() function. This will download the datasets containing expression, pheno and feature data as well as the annotations of the probe sets and store it as ‘.txt.gz’ files in the defined datadir folder.

Moreover, this approach enables us to download also the raw data in CEL file format. These files need to be uncompressed. Subsequently a GCRMA normalization has to be conducted.

For this, further commands are necessary to subsequently import the CEL files into the R environment. eset_c represents a normalized ExpressionSet, though only containing expression data without pheno data or feature data. These data can be added later. (The study GSE19188 could also be downloaded directly by applying getGEO() instead of using the raw data and manually adding pheno and feature data afterwards. We just wanted to demonstrate how to use raw data in form of CEL files and apply normalization to it.)

# download raw data, uncompress them and execute GCRMA normalization
# actually not necessary for GSE19188 but to showcase feasibility of dealing with .CEL files
GEOquery::getGEOSuppFiles("GSE19188", baseDir = filePath)
utils::untar(paste0(filePath, "/GSE19188/GSE19188_RAW.tar"), exdir=maindir)
cels <- list.files(maindir, pattern = "[gz]")
tryCatch({
  sapply(paste(maindir, cels, sep="/"), GEOquery::gunzip)
  # at this point, first the textfile "phenodata.txt must be generated
  ## For downloading raw data use the function: getGEOSuppFiles("").
  ## After downloading and uncompressing the raw file (*_RAW.tar) we need to capture
  ## the experimental information. Therefore a text file is created in the terminal 
  ## window or rather in the command line.
  system("bash -c 'ls ./geodata/*.CEL > ./geodata/phenodata.txt'")
  ## The command: ls data/*.CEL > data/phenodata.txt puts all .CEl files in one text file.
  ## After this step the normalization can be started.
}, error = function(e){
  print(e)
})

cels <- list.files(maindir, pattern = "CEL$")

celfiles <- paste0(maindir, cels)
# use justGCRMA now due to memory issues with gcrma --> justGCRMA is much more memory efficient
# https://www.rdocumentation.org/packages/gcrma/versions/2.44.0/topics/justGCRMA
eset_c <- gcrma::justGCRMA(filenames = celfiles, fast = TRUE)
# call garbage collection here
gc()

# old code
#celfiles <- affy::ReadAffy(filenames = gsub("\\.gz", "", cels), celfile.path = "./geodata")
#eset_c <- gcrma::gcrma(celfiles, fast = TRUE)

# eset_c = now is a normalized ExpressionSet, but without f- and pData
# the f- and pData are transfered manually from dataset GSE19188

GSE19188 <- GEOquery::getGEO("GSE19188", destdir = datadir)

eset3 <- GSE19188[[1]]
pData3 <- Biobase::pData(eset3)
fData3 <- Biobase::fData(eset3)
Biobase::pData(eset_c) <- pData3
Biobase::fData(eset_c) <- fData3
colnames(Biobase::exprs(eset_c)) <- colnames(Biobase::exprs(eset3))
Biobase::annotation(eset_c) <- Biobase::annotation(eset3)
eset3 <- eset_c

Formatting the phenoData

The phenoData of the expresssionSets need to be formatted so that the phenoData columns overlap perfectly. This means that all ExpressionSets should contain identical columns in the same order. Non consistent columns need to be removed. For a simpler procedure we recommend only retaining phenoData columns required for the downstream analysis (e.g. targetcol and survival data).

# create version b, that original ExpressionSets persist
eset1b <- eset1
eset2b <- eset2
eset3b <- eset3

# show number of samples and phenoData columns
dim(Biobase::pData(eset1b))
#> [1] 91 33
dim(Biobase::pData(eset2b))
#> [1] 120  37
dim(Biobase::pData(eset3b))
#> [1] 156  44
# depict phenoData column names
colnames(Biobase::pData(eset1b))
#>  [1] "title"                   "geo_accession"          
#>  [3] "status"                  "submission_date"        
#>  [5] "last_update_date"        "type"                   
#>  [7] "channel_count"           "source_name_ch1"        
#>  [9] "organism_ch1"            "characteristics_ch1"    
#> [11] "characteristics_ch1.1"   "molecule_ch1"           
#> [13] "extract_protocol_ch1"    "label_ch1"              
#> [15] "label_protocol_ch1"      "taxid_ch1"              
#> [17] "hyb_protocol"            "scan_protocol"          
#> [19] "description"             "data_processing"        
#> [21] "platform_id"             "contact_name"           
#> [23] "contact_email"           "contact_institute"      
#> [25] "contact_address"         "contact_city"           
#> [27] "contact_zip/postal_code" "contact_country"        
#> [29] "supplementary_file"      "data_row_count"         
#> [31] "relation"                "sample type:ch1"        
#> [33] "tissue:ch1"
colnames(Biobase::pData(eset2b))
#>  [1] "title"                   "geo_accession"          
#>  [3] "status"                  "submission_date"        
#>  [5] "last_update_date"        "type"                   
#>  [7] "channel_count"           "source_name_ch1"        
#>  [9] "organism_ch1"            "characteristics_ch1"    
#> [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
#> [13] "characteristics_ch1.3"   "molecule_ch1"           
#> [15] "extract_protocol_ch1"    "label_ch1"              
#> [17] "label_protocol_ch1"      "taxid_ch1"              
#> [19] "hyb_protocol"            "scan_protocol"          
#> [21] "description"             "data_processing"        
#> [23] "platform_id"             "contact_name"           
#> [25] "contact_email"           "contact_department"     
#> [27] "contact_institute"       "contact_address"        
#> [29] "contact_city"            "contact_zip/postal_code"
#> [31] "contact_country"         "supplementary_file"     
#> [33] "data_row_count"          "age:ch1"                
#> [35] "gender:ch1"              "stage:ch1"              
#> [37] "tissue:ch1"
colnames(Biobase::pData(eset3b))
#>  [1] "title"                   "geo_accession"          
#>  [3] "status"                  "submission_date"        
#>  [5] "last_update_date"        "type"                   
#>  [7] "channel_count"           "source_name_ch1"        
#>  [9] "organism_ch1"            "characteristics_ch1"    
#> [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
#> [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
#> [15] "molecule_ch1"            "extract_protocol_ch1"   
#> [17] "label_ch1"               "label_protocol_ch1"     
#> [19] "taxid_ch1"               "hyb_protocol"           
#> [21] "scan_protocol"           "description"            
#> [23] "data_processing"         "platform_id"            
#> [25] "contact_name"            "contact_email"          
#> [27] "contact_phone"           "contact_fax"            
#> [29] "contact_department"      "contact_institute"      
#> [31] "contact_address"         "contact_city"           
#> [33] "contact_zip/postal_code" "contact_country"        
#> [35] "contact_web_link"        "supplementary_file"     
#> [37] "data_row_count"          "relation"               
#> [39] "relation.1"              "cell type:ch1"          
#> [41] "gender:ch1"              "overall survival:ch1"   
#> [43] "status:ch1"              "tissue type:ch1"


# remove unnecessary phenoData
# rename desired column names and if necessary the column entries (as we did for 'Tumor')

# eset1b
eset1b$relation <- NULL
eset1b$characteristics_ch1 <- NULL
eset1b$`sample type:ch1` <- NULL
eset1b$`tissue:ch1` <- NULL
colnames(Biobase::pData(eset1b))[8] = targetcol
levels(eset1b[[targetcol]]) = c(controlname, targetname)
dim(eset1b)
#> Features  Samples 
#>    54675       91


# eset2b
eset2b$characteristics_ch1.2 <- NULL
eset2b$characteristics_ch1.3 <- NULL
eset2b$contact_department <- NULL
eset2b$characteristics_ch1 <- NULL
eset2b$`age:ch1` <- NULL
eset2b$`gender:ch1` <- NULL
eset2b$`stage:ch1` <- NULL
eset2b$`tissue:ch1` <- NULL
colnames(Biobase::pData(eset2b))[8] = targetcol
levels(eset2b[[targetcol]]) = c(controlname, targetname)
dim(eset2b)
#> Features  Samples 
#>    54675      120


# eset3b
eset3b$characteristics_ch1.2 <- NULL
eset3b$characteristics_ch1.3 <- NULL
eset3b$characteristics_ch1.4 <- NULL
eset3b$contact_phone <- NULL
eset3b$contact_fax <- NULL
eset3b$contact_department <- NULL
eset3b$contact_web_link <- NULL
eset3b$relation <- NULL
eset3b$source_name_ch1 <- NULL
colnames(Biobase::pData(eset3b))[9] = targetcol
levels(eset3b[[targetcol]]) = c(controlname, targetname)
pNew <- Biobase::pData(eset3b)[,c(1:7,9,8,10:29)] # reorder columns
Biobase::pData(eset3b) = pNew
dim(eset3b)
#> Features  Samples 
#>    54675      156

# now, all phenoData columns are consistent
cbind(colnames(Biobase::pData(eset1b)), colnames(Biobase::pData(eset2b)), 
      colnames(Biobase::pData(eset3b)))
#>       [,1]                      [,2]                     
#>  [1,] "title"                   "title"                  
#>  [2,] "geo_accession"           "geo_accession"          
#>  [3,] "status"                  "status"                 
#>  [4,] "submission_date"         "submission_date"        
#>  [5,] "last_update_date"        "last_update_date"       
#>  [6,] "type"                    "type"                   
#>  [7,] "channel_count"           "channel_count"          
#>  [8,] "target"                  "target"                 
#>  [9,] "organism_ch1"            "organism_ch1"           
#> [10,] "characteristics_ch1.1"   "characteristics_ch1.1"  
#> [11,] "molecule_ch1"            "molecule_ch1"           
#> [12,] "extract_protocol_ch1"    "extract_protocol_ch1"   
#> [13,] "label_ch1"               "label_ch1"              
#> [14,] "label_protocol_ch1"      "label_protocol_ch1"     
#> [15,] "taxid_ch1"               "taxid_ch1"              
#> [16,] "hyb_protocol"            "hyb_protocol"           
#> [17,] "scan_protocol"           "scan_protocol"          
#> [18,] "description"             "description"            
#> [19,] "data_processing"         "data_processing"        
#> [20,] "platform_id"             "platform_id"            
#> [21,] "contact_name"            "contact_name"           
#> [22,] "contact_email"           "contact_email"          
#> [23,] "contact_institute"       "contact_institute"      
#> [24,] "contact_address"         "contact_address"        
#> [25,] "contact_city"            "contact_city"           
#> [26,] "contact_zip/postal_code" "contact_zip/postal_code"
#> [27,] "contact_country"         "contact_country"        
#> [28,] "supplementary_file"      "supplementary_file"     
#> [29,] "data_row_count"          "data_row_count"         
#>       [,3]                     
#>  [1,] "title"                  
#>  [2,] "geo_accession"          
#>  [3,] "status"                 
#>  [4,] "submission_date"        
#>  [5,] "last_update_date"       
#>  [6,] "type"                   
#>  [7,] "channel_count"          
#>  [8,] "target"                 
#>  [9,] "organism_ch1"           
#> [10,] "characteristics_ch1.1"  
#> [11,] "molecule_ch1"           
#> [12,] "extract_protocol_ch1"   
#> [13,] "label_ch1"              
#> [14,] "label_protocol_ch1"     
#> [15,] "taxid_ch1"              
#> [16,] "hyb_protocol"           
#> [17,] "scan_protocol"          
#> [18,] "description"            
#> [19,] "data_processing"        
#> [20,] "platform_id"            
#> [21,] "contact_name"           
#> [22,] "contact_email"          
#> [23,] "contact_institute"      
#> [24,] "contact_address"        
#> [25,] "contact_city"           
#> [26,] "contact_zip/postal_code"
#> [27,] "contact_country"        
#> [28,] "supplementary_file"     
#> [29,] "data_row_count"

Preparations for utilizing the sigident package

Before using the sigident package, these variables need to be defined properly. One could use them also as arguments directly in the respective function. However, to keep the code clearer we define them here at the beginning and refer at each function to the respective variable. idtype can be either “affy” or “entrez”, indicating, if either Affy-IDs or Entrez-IDs are to be used as feature names for the subsequent analyses.
Selecting “entrez” usually results in a reduction of total features due to the removing of empty strings and duplicate IDs.
We here use idtype = “affy” for all subsequent analyses.

Extract metadata

The metadata are extracted directly from the ExpressionSets and are deposited for a easier applicability in the data frame sample_metadata. This data frame consists of the GEO accession numbers of the study (GPLxxx), the sample accession numbers (GSMxxx) and the targetcol (named “target”) containing the outcome of interest (currently, only a binary coded target variable is supported for the subsequent analyses, e.g. “Cancer - No Cancer”).

Define study metadata

sutdyMetadata contains the GEO series accession numbers of the studies and the assignment due to its usage as discovery dataset or validation dataset.

Merging and batch correction

In this framework for merging, only genes that are present in all datasets are preserved. In our example the entire probe set of 54675 transcripts remains. mergedset contains the samples of the studies grouped as discovery and represents an “ExpressionSet” object describing samples, annotations and further information about the features.

Discovering batch effects

Batch effects are systematic non-biological variation between studies due to experimental and technical artifacts. As first visualization a boxplot is created with the included samples on the x-axis and the expression values on the y-axis. Thereby, considerable discrepancies between the three studies can already be recognized.

Visualize batch effect

A more powerfull approach for batch effect detection is conducting a guided Principle Component Analysis (gPCA) implemented in the gPCA package.

In order to correct for occurring batch effects and other unwanted variation in high-throughput experiments, the ComBat function from the sva package is applied inside the function batch_correction(). The ComBat function adjusts for known batches using an empirical Bayesian framework [1].

mergeset results as output of the above described merging approach and represents a matrix containing batch corrected expression data with genes in the rows and samples in the columns, not an ExpressionSet anymore.

Visualize batch effect removal

Creating a boxplot and a gPCA plot with mergeset shows that no batch effect is present anymore.

DEG analysis

A common task preceeding expression data analysis is to perform statistical analysis to discover quantitative changes in expression levels between experimental groups. For this reason we here offer the identification of differentially expressed genes (DEGs) based on the limma package. In order correct for multiple testing we considered the p-value adjustment by conducting the “BH” method, which controls the expected false discovery rate (FDR).
A heatmap based on the selected DEGs is created and clear differences regarding the expression profiles between the groups (“Controls” [blue] and “Lung Cancer” [red]) can be recognized.

# define the false discovery rate here
fdr <- 0.05

genes <- sigident::identify_degs(mergeset = mergeset,
                                 design = design,
                                 q_value = fdr)

# heatmap creation
filename <- paste0(plotdir, "DEG_heatmap.png")
# create colors for map
ht_colors <- sigident::color_heatmap(sample_metadata = sample_metadata,
                                     study_metadata = study_metadata,
                                     targetcol = targetcol,
                                     controlname = controlname) # cancer = red
sigident::plot_deg_heatmap(mergeset = mergeset,
                            genes = genes,
                            patientcolors = ht_colors,
                            filename = filename)
knitr::include_graphics(filename)

Some investigations may have been finished at this point yielding the DEGs. We provide two tables containing annotations, like gene symbols (DEG_info.csv) and the results from the limma analysis including log2 fold changes and the adjusted p-values (DEG_results.csv).

deg_info <- sigident::export_deg_annotations(mergedset = mergedset,
                                            genes = genes,
                                            idtype = idtype)
data.table::fwrite(deg_info, paste0(csvdir, "DEG_info.csv"))
dim(deg_info)
#> [1] 699   5
knitr::kable(head(deg_info))
probe_ID gene_symbol gene_title genebank_accession entrez_id
1552318_at GIMAP1 GTPase, IMAP family member 1 AK091818 170575
1552509_a_at CD300LG CD300 molecule-like family member g NM_145273 146894
1552619_a_at ANLN anillin, actin binding protein NM_018685 54443
1552767_a_at HS6ST2 heparan sulfate 6-O-sulfotransferase 2 NM_147174 90161
1552797_s_at PROM2 prominin 2 NM_144707 150696
1553105_s_at DSG2 desmoglein 2 NM_001943 1829
deg_results <- sigident::limma_top_table(mergeset = mergeset,
                                        design = design,
                                        q_value = fdr)
data.table::fwrite(deg_results, paste0(csvdir, "DEG_results.csv"))
dim(deg_results)
#> [1] 699   3
knitr::kable(head(deg_results))
Probe ID logFC adj.q_value
217046_s_at 217046_s_at -4.210404 0
210081_at 210081_at -5.357901 0
209470_s_at 209470_s_at -4.554046 0
209469_at 209469_at -3.945938 0
206209_s_at 206209_s_at -4.396728 0
209904_at 209904_at -4.090455 0

Functional analysis

Gene enrichment

A common investigation regarding differentially expressed genes analysis is the functional annotation of the DEGs. Furthermore, it is useful to find out if the DEGs are associated with biological processes or molecular functions. The functional analysis supports GO annotation from OrgDb object.

Test for over-representation

For the identification of enriched GO terms and KEGG pathways an over-representation analysis is performed based on linear models. As input the Entrez-IDs and the definition of the species are needed. extract_go_terms and extract_kegg_terms output tables containing the most significant top GO terms and KEGG terms respectively. In either case, rows are sorted by the minimum p-value with
- Term: GO term
- (Pathway: KEGG pathway)
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- DE: number of genes in the input Entrez Gene IDs (deg_entrez)
- P.DE: p-value for over-representation of the GO term in the set

Term Ont N DE P.DE
GO:0044444 cytoplasmic part CC 9299 8712 0
GO:0005737 cytoplasm CC 10896 10117 0
GO:0005515 protein binding MF 11329 10693 0
GO:0005488 binding MF 14625 13242 0
GO:0003674 molecular_function MF 16969 14960 0
GO:0005622 intracellular CC 13992 12445 0
Pathway N DE P.DE
path:hsa05200 Pathways in cancer 531 515 0
path:hsa01100 Metabolic pathways 1487 1380 0
path:hsa04151 PI3K-Akt signaling pathway 354 341 0
path:hsa04360 Axon guidance 181 179 0
path:hsa05165 Human papillomavirus infection 330 318 0
path:hsa04810 Regulation of actin cytoskeleton 213 209 0

The following test for over-representation is based on the same method, but also taking into account for over expression and under expression in enriched terms with
- Term: GO term
- Ont: ontology that GO term belongs to
- N: number of genes in the GO term
- Up: number of up-regulated DEGs
- Down: number of down-regulated DEGs
- P.Up: p-value for over-representation of GO term in up-regulated genes
- P.Down: p-value for over-representation of GO term in down-regulated genes

Term Ont N Up Down P.Up P.Down
GO:0043231 intracellular membrane-bounded organelle CC 9294 4652 3261 0 0
GO:0005622 intracellular CC 12445 5898 4493 0 0
GO:0044424 intracellular part CC 12445 5898 4493 0 0
GO:0070013 intracellular organelle lumen CC 4844 2715 1678 0 0
GO:0031974 membrane-enclosed lumen CC 4844 2715 1678 0 0
GO:0043233 organelle lumen CC 4844 2715 1678 0 0
Pathway N Up Down P.Up P.Down
path:hsa05200 Pathways in cancer 515 230 258 0.6344687 0.0000000
path:hsa03013 RNA transport 143 113 41 0.0000000 0.8632953
path:hsa04380 Osteoclast differentiation 126 37 86 0.9999171 0.0000000
path:hsa04010 MAPK signaling pathway 283 113 158 0.9710721 0.0000000
path:hsa04144 Endocytosis 234 94 136 0.9513970 0.0000000
path:hsa04510 Focal adhesion 194 85 115 0.6883347 0.0000000

GO enrichment analysis

GO enrichment analysis of a given set of genes (deg_entrez) with the defined organism is based on the fitted linear models of the object enr_fitlm as output of the go_diff_reg function.
goEnrichmentAnalysis provides the presentation of the desired KEGG pathway ID (in this case hsa04151 - PI3K-Akt-Pathway) including under expressed (red) and over expressed (green) genes inside the pathway as png images in the folder named ‘plots’ (object ‘plotdir’).

Using plot_enriched_barplot, a bar plot is created depicting the enrichment scores (adj. p-value) as bar color and gene count as bar height. The parameter type controls whether to depict enriched GO terms or enriched KEGG terms.

Identification of diagnostic signatures

A common task regarding gene expression data is to distinguish between different groups or subtypes. In order to train a multi-gene classifier, a large amount of expression data is required. Due to the imbalance of features (p) versus observations (n), the application of an appropriate method is needed.
The lasso regression and elastic net regularization are generalized linear models which are powerful and versatile methods for feature selection and are well suited in order to analyse genomic data with p >> n.
Therefore, we implemented the elastic net regularization method for the identification of a diagnostic multi-gene signature.

Split data into training and test dataset

A statistical prediction model is usually trained on 70%-80% percent of the present data. create_training_test_split randomly splits the merged dataset into a training and a test set with the former defined value split regarding to a balanced distribution of the groups (here diagnosis). For reproducibility, a seed must be set.

Lasso regression

Regularization requires the selection of a tuning parameter (lambda) that determines the strength of regularization. The function signature performs an nfolds cross-validation to select the best lambda and additionally builds the predictive model. seed is again set for reproducibility. The plot depicts the mean squared error as the criteron for the 10-fold cross-validation and shows the optimal values of lambda and the appropriate number of selected features (genes) above the plot. The left dashed vertical line represents the log of the optimal value of lambda (‘lambda min’), which results in the lowest prediction error and giving the most accurate model. The right dashed vertical line indicates the log of the lambda value resulting in a model with the smallest number of included variables but also lies within one standard error of the optimal value of lambda (‘lambda 1se’).
Furthermore, a ROC curve is plotted to visualize the performance measurement of the model.

Elastic net regression

The elastic net is a mixture of lasso and ridge regularization to shrink coefficients (like in ridge regression) and to set some coefficients to zero (as in lasso). Therefore, the mixing parameter alpha needs to be determined as value between 1 (for lasso) and 0 (for ridge). A grid search for calculating optimal alpha and lambda can be conducted (as it is done next) but might take much computing capacity, hence we propose setting alpha manually to 0.9.

Gridsearch to find alpha and lambda

Setting type = "grid" of the function signature conducts a grid search calculating a pair of alpha and lambda for the ‘glmnet’-method, that minimizes the CV error, using the R package caret.

Analysis of diagnostic models

Comparing model performance

These models can be included into a list and the performance metrices for classification can then be printed using the function compare_diagnostic_models(). This list includes all information of the respective models, which can be accessed by the following structure.

Model AUC
lasso.min 0.981
lasso.1se 0.983
elasticnet.min 0.981
elasticnet.1se 0.983
grid 0.976

Extract lambda values

A function get_diagnostic_lambda_values was designed, to extract the results of the cross validation.

Model Lambda min Lambda 1se
lasso 0.009260 0.059521
elasticnet 0.010288 0.063129
grid (best tune, alpha, lambda) 0.100000 0.521401

Model performance measurements

Additionally, the confusion matrix and the calculated performance metrices can be printed out for each model.

Validation of diagnostic signatures

The validation of diagnostic signatures is conducted using expression data of the studies “GSE30219”, “GSE33356” and “GSE102287”. We therefore need to create a list, holding meta information about the respective studies.

All models can be provided via the argument ‘model’ to the function validate_diagnostic_signature using the previously created list diagnostic_models.

You have to create a list for each validation data that contains specifications of the validation study. The object must be named The list structure is as follows:
- studyname: the GEO id of the validation study
- setid: the index of the provided GEO set to be used
- targetcolname: the name of the phenoData column containing the target-variable - controllevelname: the level name of the controls of the column, specified in targetcolname - targetlevelname: the level name of the subjects of interest in targetcolname

GSE302019

GSE33356

GSE33356_validation <- sigident::validate_diagnostic_signature(
  validationstudylist = validationstudylist,
  models = diagnostic_models,
  genes = genes,
  idtype = idtype,
  targetname = targetname,
  controlname = controlname,
  targetcol = targetcol,
  datadir = datadir
)
#> Found 2 file(s)
#> GSE33356-GPL570_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> Using locally cached version of GPL570 found here:
#> ./geodata/data//GPL570.soft
#> Warning: 62 parsing failures.
#>   row     col           expected    actual         file
#> 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> ..... ....... .................. ......... ............
#> See problems(...) for more details.
#> GSE33356-GPL6801_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#>   .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> ./geodata/data//GPL6801.soft
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases

GSE102287

GSE102287_validation <- sigident::validate_diagnostic_signature(
  validationstudylist = validationstudylist,
  models = diagnostic_models,
  genes = genes,
  idtype = idtype,
  targetname = targetname,
  controlname = controlname,
  targetcol = targetcol,
  datadir = datadir
)
#> Found 2 file(s)
#> GSE102287-GPL23871_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> File stored at:
#> ./geodata/data//GPL23871.soft
#> GSE102287-GPL570_series_matrix.txt.gz
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   ID_REF = col_character()
#> )
#> See spec(...) for full column specifications.
#> Using locally cached version of GPL570 found here:
#> ./geodata/data//GPL570.soft
#> Warning: 62 parsing failures.
#>   row     col           expected    actual         file
#> 54614 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54615 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54616 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54617 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> 54618 SPOT_ID 1/0/T/F/TRUE/FALSE --Control literal data
#> ..... ....... .................. ......... ............
#> See problems(...) for more details.
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases

Identification of prognostic signatures

Create a list that contains specifications of the study/studies that contain(s) survival time information

For the calculation of a prognostic signature it is first required to transform the phenoData containing the relevant survival data. In our example, only the study GSE19188 contains information on a subjects survival.
In order to make the subsequent analyses possible, the list discoverystudies_w_timedata, containing metainformation about the validation study needs to be created. This nested list needs to contain the following elements:
- the studyname, which is the key for a list containing
+ setid: the index of the provided GEO set to be used
+ timecol: the phenoData columm name of the survival time
+ status: a list, containing the elements statuscol (name of phenoData column containing the survival status), levels (a list containing the originally used level-names of statuscol, namely alive, deceased, na)
+ targetcolname: the name of the phenoData column containing the target-variable
+ controllevelname: the level name of the controls of the column, specified in targetcolname
+ targetlevelname: the level name of the subjects of interest in targetcolname
+ use_rawdata: a logical value (TRUE or FALSE) that indicates, if the study is imported from the raw data

Since we already used the raw data above to load the study ‘GSE19188’ for the analyses of the DEGs, the enrichment analysis and the computation of the diagnostic signature, we here again import the study from the raw CEL files.

Identification of survival correlated genes

Prognostic classifier and Kaplan-Meier estimator

In order to build an unbiased classifier that speparates subjects into high risk and low risk groups, expression data from the remaining discovery studies are required.

Apply the prognostic classifier on validation dataset

In order to apply the prognostic classifier on validation datasets, another list containing metainformation about these validation studies is required. This list (validationstudiesinfo) has the same structure as the above described list discoverystudies_w_timedata.
However, the element use_rawdata is not allowed here. Furthermore, is allowed to to set targetlevelname to ‘NULL’, which indicates, that all included samples are of the class of interest and are taken into account to perform the analyses.

An alternative workflow: four generic functions for signature identification

In order to simplify the workflow described above, we wrapped all these functions into four big functions.

Preprocessing:

The preprocessing of the data needs to conform with the above described approach.

Create Diagnosis, Design and Batch

Create mergeset

Run sigidentDEG-function

Run sigidentEnrichment-function

Run sigidentDiagnostic-function

Run sigidentPrognostic-Function

References

[1] W.E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007.